dir <-"~/work/courses/stat380/weeks/week-5"setwd(dir)library(renv)# renv::activate()renv::restore()
Attaching package: ‘renv’
The following objects are masked from ‘package:stats’:
embed, update
The following objects are masked from ‘package:utils’:
history, upgrade
The following objects are masked from ‘package:base’:
autoload, load, remove
* The library is already synchronized with the lockfile.
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Tue, Feb 7
\[
y_i = \beta_0 + \beta_1 x_i + \epsilon_i
\]
where:
\(y_i\) are the response
\(x_i\) is the covariate
\(\epsilon_i\) is the error (vertical black line in lecture 4 notes)
\(\beta_0\) and \(\beta_1\) are the regression coefficients
\(i = 1, 2, \dots, n\) are the indices for the observations
Can anyone tell me the interpretation for the regression coefficients?
\(\beta_0\) is the intercept and \(\beta_1\) is the slope.
Let’s consider the following example using mtcars
library(ggplot2)mtcars %>%head()
A data.frame: 6 × 11
mpg
cyl
disp
hp
drat
wt
qsec
vs
am
gear
carb
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
Mazda RX4
21.0
6
160
110
3.90
2.620
16.46
0
1
4
4
Mazda RX4 Wag
21.0
6
160
110
3.90
2.875
17.02
0
1
4
4
Datsun 710
22.8
4
108
93
3.85
2.320
18.61
1
1
4
1
Hornet 4 Drive
21.4
6
258
110
3.08
3.215
19.44
1
0
3
1
Hornet Sportabout
18.7
8
360
175
3.15
3.440
17.02
0
0
3
2
Valiant
18.1
6
225
105
2.76
3.460
20.22
1
0
3
1
Consider the following relationship
x <- mtcars$hpy <- mtcars$mpgplot(x, y, pch=20, xlab="HP", ylab="MPG")model <-lm(y~x)summary(model)
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-5.7121 -2.1122 -0.8854 1.5819 8.2360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
x -0.06823 0.01012 -6.742 1.79e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.863 on 30 degrees of freedom
Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
For the intercept this means that:
A hypothetical car with hp=0 will have mpg = 30.09 = \(\beta_0\)
It’s more interesting and instructive to consider the interpretation of the slope:
Let’s say we have some covariate \(x_0\) then the expected value for \(y(x_0)\) is given by
Even if \(x\) is categorical, we can still write down the regression model as follows:
where \(x_i \in \{ \texttt{setosa}, \ \texttt{versicolor}, \ \texttt{virginica} \}\). This means that we end up with, (fundamentally) three different models
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
The following objects are masked from Credit (pos = 4):
Age, Balance, Cards, Education, Income, Limit, Married, Own,
Rating, Region, Student
A tibble: 6 × 11
income
limit
rating
cards
age
education
own
student
married
region
balance
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
<fct>
<fct>
<fct>
<fct>
<dbl>
14.891
3606
283
2
34
11
No
No
Yes
South
333
106.025
6645
483
3
82
15
Yes
Yes
Yes
West
903
104.593
7075
514
4
71
11
No
No
No
West
580
148.924
9504
681
3
36
11
Yes
No
No
West
964
55.882
4897
357
2
68
16
No
No
Yes
South
331
80.180
8047
569
4
77
10
No
No
No
South
1151
and, we’ll look at the following three columns: income, rating, limit.